Who We Are
Bloomberg is the global leader in business and financial data. Providing real-time and historical market data to our customers - reliably, accurately, and quickly - is at the heart of what we do, and the Ticker Plant system is the core that makes it happen. Our system processes hundreds of billions of unique market events every single day. We ingest and process events from hundreds of exchanges and thousands of other financial institutions, 24 hours a day, around the world, on millions of financial instruments across all asset classes, including stocks, bonds, commodities, currencies, and crypto. We disseminate corresponding updates to our clients in real-time, after the events have been normalized and enriched by our systems. In addition, we respond to billions of requests for current snapshot and historical data every day, retrieved from our petabytes of recorded market history, to which we add terabytes of new data to.
The SRE team is central to Ticker Plant's success! We are engineers whose expertise centers on the emergent properties of a large-scale, distributed, real-time market data system. Our mission aligns with our customers' expectations, and we focus on the characteristics of the system they care about, namely:
• Correctness - the data a customer sees should accurately reflect the marketplace
• Performance - real-time latencies should be minimized; requests should be served without delay
• Availability - System components will fail; in a sufficiently large system, parts of it fail all the time. But the system as a whole should not fail.
At the scale at which we operate, we cannot achieve these goals without sophisticated monitoring, proactive management, and automated response mechanisms. Thus, we concern ourselves with latency analysis, capacity management, cluster organization, deployment and configuration, fault tolerance, and telemetry. In addition to developing software, we also advise our partner component teams on the development of resilient software, and we analyze and fix system failures as they happen.
What's in it for you:
• Design and develop predictive data models for our system capacity
• Build systems capable of early detection of issues through metrics and signals, and develop automated correction and remediation strategies
• Develop Python/C++ services, libraries and tools that implement our designs
• Proactively scale our services to stay ahead of ever-increasing market data demands by driving capacity planning, instrumentation and performance analysis
• Define service level objectives and apply them to drive measurable service improvement
• Manage entire projects, including meeting with partners, and build implementation plans
• Share your accomplishments at internal forums and speak at industry conferences (e.g. SRECon)
We'll trust you to:
• Code - to read, debug, and write production-quality code.
• Design - write code that integrates with components across the entire system, often in collaboration with component teams. This involves assessing workflows and designing appropriate interfaces that provide consistent access to the vital functionality, and then building the applications that can perform many workflows.
• Analyze - SRE is concerned with the behavior of our system. We are often asked to consider the impact of potential changes prior to production or analyze causes to why the system is not behaving as expected.
You'll need to have:
• 4+ years working with an object-oriented programming language (C/C++, Python, Java, etc.)
• A Degree in Computer Science, Engineering, Mathematics, similar field of study or equivalent work experience
• An understanding of Computer Science fundamentals such as data structures and algorithms
• Prior contributions to system design and architecture and scaling fault-tolerant, distributed systems
• Honest approach to problem-solving, and ability to collaborate with peers, partners, and management
We'd love to see:
• Comfortable with data analysis and quantifying decision-making process
• Monitoring - assessing system health and performance, understanding SLIs and SLOs and alerting mechanisms
• Distributed systems - heterogeneity, fault tolerance, network and node failure, local inconsistencies (delays in convergence of shared state)
• Cluster management - clusters, deployments, staging, configuration management, A/B testing
• Workflow automation through orchestration
• Operating systems - processes, threads, and scheduling, file systems, memory management, performance tuning; knowledge of Linux or other POSIX-based system is especially useful